Indexed Searching on Proteins Using a Suffix Sequoia

نویسنده

  • Ela Hunt
چکیده

Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The data structure is compact. It uses just over 4 Bytes per symbol indexed. We show that time complexity of the DP calculation is O(qg) for a pattern of length q, alphabet size g, and indexing window size d. The DP calculation requires no disk access, and can be executed efficiently. The second phase of the algorithm is based on sequential disk access, and appears to be effective. Approximate matching experiments are promising and offer a lot of scope for algorithm refinement and data structure engineering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PSISA: An Algorithm for Indexing and Searching Protein Structure using Suffix Arrays

Protein Structure Indexing using Suffix Array (PSISA) is a new technique provides the ability to retrieve similarities of proteins based on the proteins structures. Indexing the protein structure is one approach of searching for protein similarities. In this paper we developed our proposed technique based on novel use of suffix array. We start by converting protein structure into a sequence by ...

متن کامل

The Suffix Sequoia

Standard technologies for sequence searching do not use database indexes. These solutions can be divided into exhaustive algorithms, e.g. the Smith-Waterman algorithm [11], and heuristic ones, like BLAST [1, 2], FASTA [10], and BLAT [7]. Specialised tools for DNA matching exist, such as SIM4 [3] and SSAHA [9]. Only BLAT and SSAHA use indexing. BLAT can be used with proteins, however, its sensit...

متن کامل

Protein Structure Searching using Suffix Arrays

Searching for similarities of proteins using Structured-based query, has a vital role in many applications like drug discovery and drug design, disease diagnosis and treatment and protein classification. Indexing the protein structure is one approach of searching protein structure for similarities. In this paper we proposed a method to enhance the memory space for storing the indexed data witho...

متن کامل

String matching with alphabet sampling

We introduce a novel alphabet sampling technique for speeding up both online and indexed string matching. We choose a subset of the alphabet and extract the corresponding subsequence of the text. Online or indexed searching is then carried out on the extracted subsequence, and candidate matches are verified in the full text. We show that this speeds up online searching, especially for moderate ...

متن کامل

PSIST: A Scalable Approach to Indexing

6 Approaches for indexing proteins, and for fast and scalable searching for struc7 tures similar to a query structure have important applications such as protein struc8 ture and function prediction, protein classification and drug discovery. In this paper, 9 we develop a new method for extracting local structural (or geometric) features from 10 protein structures. These feature vectors are in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Data Eng. Bull.

دوره 27  شماره 

صفحات  -

تاریخ انتشار 2004